So, the point is now how to actually compute a policy that is ideally optimal.
And we have two ways of going about that.
Either we look at the utilities of the individual actions of the individual states or we try
to compute a policy as a whole.
Value iteration does the first thing.
The basic idea is that the expected sum of rewards for some sequence of actions is the
current reward plus my discount factor times the expected reward of the sums afterwards.
You can basically derive that simply from the definition of the utility of states.
Where is it?
This thing here.
This leads to this giant Bellman equation which is going to be very important in the
future.
We say the utility of a state S is the reward at that state plus that action such that the
sum over possible neighboring states, their utility times the transition model, all of
this becomes maximal.
If you do this, you note that the utility at point one one, which is down there, this
is a bit confusing, is the reward which is just the usual cost of staying alive plus
gamma times all of this, the maximum of all of those.
We can use that to compute the utility for each state which is nice.
The annoying thing is we have one equation for each state of course but also there's
this pesky maximum thing in here which is annoying because that makes the whole thing
nonlinear which means we can't use linear algebra which means we can't just do simple
matrix manipulation and all of that.
That makes things a bit difficult to compute but still it's a first approximation.
So now how do we use this to get at an algorithm?
Well the basic idea is simple.
You just assign some utility to each node at the beginning and then you use the Bellman
equation to basically update the utilities of the neighboring states.
You use it again to update the utilities of those neighboring states and all of that.
That makes everything locally consistent in the sense of it makes the utility of each
state consistent with the utility of each neighboring state and if it does that at each
state it does it everywhere.
That's this point here where if it's everywhere locally consistent you get to global optimality.
The iteration algorithm itself is correspondingly rather straightforward.
You have all of the stuff that you're given anyway.
You start with one vector of utilities for each state.
Delta is just kept for finding out when to actually stop.
Now what do you do?
You update the utility for each state according to the Bellman equation, this joint thing.
You take the reward, you take the discount factor and rate it over the maximum blah blah
blah for all the neighboring states.
Then you just check has it changed sufficiently.
If it has not changed sufficiently that means we're rather close enough.
So you take some, this is all just termination, a termination heuristic basically.
Now if you do this you can compute a full utility for each node in your problem and
once you have the full utility for each action you can derive a policy from that.
That's the basic idea.
Now looking at this graph something funny happens.
This is exactly on this four times three field here.
So imagine what the algorithm does.
Presenters
Zugänglich über
Offener Zugang
Dauer
00:21:05 Min
Aufnahmedatum
2021-03-29
Hochgeladen am
2021-03-30 15:06:30
Sprache
en-US
Explanation of the Bellman equation to compute the utility of a state. Additionally, the ideas of the Value Iteration Algorithm and the Policy Iteration Algorithm are discussed.